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Application development for distributed computing "Grids" can benefit from tools 
that variously hide or enable application-level management of critical aspects of the het- 
erogeneous environment. As part of an investigation of these issues, we have developed 
MPICH-G2, a Grid-enabled implementation of the Message Passing Interface (MPI) that 
allows a user to run MPI programs across multiple computers, at the same or different 
sites, using the same commands that would be used on a parallel computer. This library 
extends the Argonne MPICH implementation of MPI to use services provided by the 
Globus Toolkit for authentication, authorization, resource allocation, executable staging, 
and I/O, as well as for process creation, monitoring, and control. Various performance- 
critical operations, including startup and collective operations, are configured to exploit 
network topology information. The library also exploits MPI constructs for performance 
management; for example, the MPI communicator construct is used for application-level 
discovery of, and adaptation to, both network topology and network quality-of-service 
mechanisms. We describe the MPICH-G2 design and implementation, present perfor- 
mance results, and review application experiences, including record-setting distributed 
simulations. 
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1. INTRODUCTION 



So-called computational Grids |T£| [lj] enable the coupling and coordinated use 
of geographically distributed resources for such purposes as large-scale computation, 
distributed data analysis, and remote visualization. The development or adapta- 
tion of applications for Grid environments is made challenging, however, by the 
often heterogeneous nature of the resources involved and the facts that that these 
resources typically live in different administrative domains, run different software, 
are subject to different access control policies, and may be connected by networks 
with widely varying performance characteristics. 

Such concerns have motivated various explorations of specialized, often high- 
level, distributed programming models for Grid environments, including various 
forms of object systems [^6|, Web technologies [|2| ^0|, problem solving en- 
vironments (?], ^5), CORBA, workflow systems, high-throughput computing sys- 
tems [0, [39|, and compiler-based systems p3[ . 

In contrast, we explore here a different approach that might appear reactionary 
in its simplicity but that, in fact, delivers a remarkably sophisticated technology 
for managing the heterogeneity associated with Grid environments. Specifically, we 
advocate the use of a well-known low-level parallel programming model, the Message 
Passing Interface (MPI), as a basis for Grid programming. While not a high-level 
programming model by any means, MPI incorporates sophisticated support for the 
management of heterogeneity (e.g., data types), for the construction of modular 
programs (the communicator construct), for management of latency (asynchronous 
operations) , and for the representation of global operations (collective operations) . 
These and other features have allowed MPI to achieve tremendous success as a 
standard programming model for parallel computers. We hypothesize that these 
same features can also be used to good effect for Grid computing. 

Our investigation of MPI as a Grid programming model has focused on three 
related questions. First, can we implement MPI constructs efficiently in Grid en- 
vironments to hide heterogeneity without introducing overhead? Second, can we 
use MPI constructs to enable users to manage heterogeneity, when this is required? 
Third, do users find MPI useful in practice for application development? 

To allow for the experimental exploration of these questions, we have devel- 
oped MPICH-G2, a complete implementation of the MPI-1 standard jl2| that uses 
services provided by the Globus Toolkit™ |l7|] to extend the popular Argonne 
MPICH implementation of MPI |§7) for Grid execution. MPICH-G2 passes the 
MPICH test suite and represents a complete redesign and reimplementation of the 
earlier MPICH-G system |L5| that increases performance significantly and incorpo- 
rates a number of innovations. Our experiences with MPICH-G2, as reported in 
this article, allow us to respond in the affirmative to each question posed in the 
preceding paragraph. 

MPICH-G2 hides heterogeneity by using Globus Toolkit services for such pur- 
poses as authentication, authorization, executable staging, process creation, process 
monitoring, process control, communication, redirection of standard input and out- 
put, and remote file access. The result is that a user can run MPI programs across 
multiple computers at different sites using the same commands that would be used 
on a parallel computer. Furthermore, performance studies show that overheads 
relative to native implementations of basic communication functions are negligible. 

MPICH-G2 enables the use of several different MPI features for user manage- 
ment of heterogeneity. MPFs asynchronous operations can be used for latency 
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management in wide-area networks. MPI's communicator construct can be used to 
represent the hierarchical structure of heterogeneous systems and thus allow appli- 
cations to adapt their behavior to such structures. (In separate work, we present 
topology-aware collective operations as one example of an "application" p2[.) We 
also show how MPI's communicator construct can be used for user-level manage- 
ment of network quality of service, as first introduced in an earlier article j47|. 

Many groups have used MPICH-G2 for the execution of both traditional parallel 
computing applications (e.g., numerical simulation) and nontraditional distributed 
computing applications (e.g., distributed visualization), in both local-area and wide- 
area networks. This variety of applications and execution environments persuades 
us that MPI can play a valuable role in Grid computing. 

MPICH-G2 is not the only implementation of MPI for heterogeneous systems. 
Others include MPICH with the ch_p4 device (which provides limited support for 
heterogeneity), PACX-MPI and STAMPI each of which has interesting 
features, as we discuss later. Magpie IMP! fl3i|, and PVM |25| also address 
relevant issues. MPICH-G2 is unique, however, in the degree to which it hides and 
manages heterogeneity, as well as in its large user community. 

In the rest of this article, we describe the problems that we faced in developing 
MPICH-G2, the techniques used to overcome these problems, and experimental 
results that indicate the performance of the MPICH-G2 implementation and the 
extent of its improvement over MPICH-G. We conclude with a discussion of appli- 
cation experiments and future directions. 

2. BACKGROUND 

We first provide some brief background on MPI, MPICH, and the Globus 
Toolkit. 

2.1. Message Passing Interface 

The Message Passing Interface standard defines a library of routines that im- 
plement the message-passing model. These routines include point-to-point commu- 
nication functions, in which a send operation is used to initiate a data transfer 
between two concurrently executing program components and a matching receive 
operation is used to extract that data from system data structures into application 
memory space; and collective operations such as broadcast and reductions that im- 
plement operations involving multiple processes. Numerous other functions address 
other aspects of message passing, including, in the MP 1-2 extensions to MPI p3[ |, 
single-sided communication and dynamic process creation. 

The primary interest of MPI from our perspective, apart from its broad adop- 
tion, is the care taken in its design to ensure that underlying performance issues 
are accessible to, not masked from, the programmer. MPI mechanisms such as 
asynchronous operations, communicators, and collective operations all turn out to 
be useful in Grid environments. 

2.2. MPICH Architecture 

MPICH [^9| is a popular implementation of the Message Passing Interface stan- 
dard. It is a high-performance, highly portable library originally developed as a 
collaborative effort between Argonne National Laboratory and Mississippi State 
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University. Argonne continues research and development efforts aimed at improv- 
ing MPICH performance and functionality. 

In its present form, MPICH is a complete implementation of the MPI-1 standard 
with extensions to support the parallel I/O functionality defined in the MPI-2 stan- 
dard. It is a mature, widely distributed library, with more than 2,000 downloads 
per month, not including downloads that occur at mirror sites. Its free distribu- 
tion and wide portability have contributed materially to the adoption of the MPI 
standard by the parallel computing community. 

MPICH derives its portability from its interfaces and layered architecture. At 
the top is the MPI interface as defined by the MPI standards. Directly beneath 
this interface is the MPICH layer, which implements the MPI interface. Much of 
the code in an MPI implementation is independent of the networking device or 
process management system. This code, which includes error checking and various 
manipulations of the opaque objects, is implemented directly at the MPICH layer. 
All other functionality is passed off to lower layers be means of the Abstract Device 
Interface (ADI). 

The ADI is a simpler interface than MPI proper and focuses on moving data be- 
tween the MPI layer and the network subsystem. Those interested in implementing 
MPI for a particular platform need only define the routines in the ADI in order to 
obtain a full implementation. Existing implementations of this device interface for 
various MPPs, SMPs, and networks provide complete MPI functionality in a wide 
variety of environments. MPICH-G2 is another implementation of the ADI and is 
otherwise known as the globus2 device. 

2.3. The Globus Toolkit 

The Globus Toolkit is a collection of software components designed to support 
the development of applications for high-performance distributed computing envi- 
ronments, or "Grids" |l7], Core components typically define a protocol for inter- 
acting with a remote resource, plus an application program interface (API) used to 
invoke that protocol. (We introduce the protocols and APIs used within MPICH-G2 
below.) Higher-level libraries, services, tools, and applications use core services to 
implement more complex global functionality. The various Globus Toolkit compo- 
nents are reviewed in plj and described in detail in online documentation and in 
technical papers. 

3. MPICH-G2: A GRID-ENABLED MPI 

As noted in the introduction, MPICH-G2 is a complete implementation of the 
MPI-1 standard that uses Globus Toolkit services to support efficient and transpar- 
ent execution in heterogeneous Grid environments, while also allowing for applica- 
tion management of heterogeneity. (It also implements client/server management 
functions found in Section 5.4 of the MPI-2 standard [fl3). However, we do not 
discuss these functions here.) 

In this section, we first describe the techniques used to hide heterogeneity during 
startup and for process management, then the techniques used to effect communica- 
tion in heterogeneous systems, and finally the support provided for application-level 
management of heterogeneity. 



5 



% grid-proxy-init 

% mpirun -np 256 myprog 

MDS^ Locates 



hosts 



Authenticates 

GRAM 

Initiates job 




mpirun 



DUROC 



GRAM 



LSF 




P3 



Generates 
resource specification 



GASS 


Stages 


globusrun 


*executables 



Submits multiple jobs 
. Coordinate s startup 



GRAM 

Detects termination 




Communicates via vendor-MPI and TCP/IP (globus-io) 



FIG. 1 Schematic of the MPICH-G2 startup, showing the various Globus 
Toolkit components used to hide and manage heterogeneity. "Fork," "LSF," and 
"LoadLeveler" are different local schedulers. 



3.1. Hiding Heterogeneity during Startup and Management 

As illustrated in Figure [l] and discussed here, MPICH-G2 uses a range of Globus 
Toolkit services to address the various complex issues that arise in heterogeneous, 
multisite Grid environments, such as cross-site authentication, the need to deal 
with multiple schedulers with different characteristics, coordinated process creation, 
heterogeneous communication structures, executable staging, and collation of stan- 
dard output. In fact, MPICH-G2 serves as an exemplary case study of how Globus 
Toolkit mechanisms can be used to create a Grid-enabled programming tool, as we 
now explain. 

Prior to startup of an MPICH-G2 application, the user employs the Grid Secu- 
rity Infrastructure (GSI) JlSf ] to obtain a (public key) proxy credential that is used 
to authenticate the user to each remote sites. This step provides a single sign on 
capability. 

The user may also use the Monitoring and Discovery Service (MDS) 1 13 to select 
computers on the basis of, for example, configuration, availability, and network 
connectivity. 

Once authenticated, the user uses the standard mpirun command to request 
the creation of an MPI computation. The MPICH-G2 implementation of this com- 
mand uses the Resource Specification Language (RSL) jio) to describe the job. In 
brief, users write RSL scripts, which identify resources (e.g., computers) and specify 
requirements (e.g., number of CPUs, memory, execution time, etc.) and parame- 
ters (e.g., location of executables, command line arguments, environment variables, 
etc.) for each. Based on the information found in an RSL script, MPICH-G2 calls a 
co- allocation library distributed with the Globus Toolkit, the Dynamically-Updated 
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Site A Site B 




FIG. 2 An example of an MPICH-G2 application running on a computational grid 
involving 4 processes on an IBM SP at Site A and 8 processes distributed evenly 
across two Linux clusters at Site B. 



Request Online Coallocator (DUROC) |11 , to schedule and start the application 



across the various computers specified by the user. 

The DUROC library itself uses the Grid Resource Allocation and Management 
(GRAM) Jl(| API and protocol to start and subsequently manage a set of subcom- 
putations, one for each computer. For each subcomputation, DUROC generates 
a GRAM request to a remote GRAM server, which authenticates the user, per- 
forms local authorization, and then interacts with the local scheduler to initiate 
the computation. DUROC and associated MPICH-G2 libraries tie the various sub- 
computations together into a single MPI computation. 

GRAM will, if directed, use Global Access to Secondary Storage (GASS) j^] to 
stage executable(s) from remote locations (indicated by URLs). GASS is also used, 
once an application has started, to direct standard output and error (stdout and 
stderr) streams to the user's terminal, and to provide access to files regardless of 
location, thus masking essentially all aspects of geographical distribution except 
those associated with performance. 

Once the application has started, MPICH-G2 selects the most efficient commu- 
nication method possible between any two processes, using vendor-supplied MPI 
(iMPI) if available, or Globus communication (Globus 10) with Globus Data Con- 
version (Globus DC) for TCP, otherwise. 

DUROC and GRAM also interact to monitor and manage the execution of 
the application. Each GRAM server monitors the life cycle of its subcomputation 
as it passes from pending to running and then to terminating, communicating 
each state transition back to DUROC. Each subcomputation is held at a DUROC- 
controlled barrier and is released from that barrier only after all subcomputations 
have started executing. Also, a request to terminate the computation ("control 
C") may be initiated by the user at which time DUROC and the GRAM servers, 
communicating via GRAM process control messages, terminate all processes. 

After the processes have started, MPICH-G2 uses information specified in the 
RSL script to create multilevel clustering of the processes based on the under- 
lying network topology. Figure || depicts an MPI application involving 12 pro- 
cesses distributed across three machines located at two sites. We depict 4 processes 
(MPI_CDMM_WDRLD ranks 0-3) on the IBM SP at Site A and 4 processes on each of 
two Linux clusters (MPI_C0MM_W0RLD ranks 4-7 and 8-11, respectively) at Site B. 
Each process in MPI_C0MM_W0RLD is assigned a topology depth. Processes that com- 
municate using only TCP are assigned topology depths of 3 (to distinguish between 
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FIG. 3 An example of depths and colors used by MPICH-G2 to represent network 
topology in a computational grid. 



wide area, local area, and intramachine TCP messaging), and processes that can 
also communicate using a vMPl have a topology depth of 4. Using these topology 
depths MPICH-G2 groups processes at a particular level through the assignment 
of colors. Two processes are assigned the same color at a particular level if they 
can communicate with each other at the network level. 

Figure [| depicts the topology depths and colors for the processes depicted in Fig- 
ure ^. Those processes capable of communicating over vMPl, (i.e., those executing 
on the IBM SP), have a depth of 4, while the other processes, (i.e., those executing 
on a Linux cluster), have a depth of 3. Since all processes are on the same wide- area 
network, they all have the same color (0) at the wide-area level. Similarly, at the 
local-area level, all the processes at Site A are assigned one color (0), while all the 
processes at Site B are assigned another (1). This structure continues through the 
system-area level, where processes are assigned the same color if and only if they 
are on the same machine. Finally, processes that can communicate over a ?MPI 
are assigned the same color at the iMPI level if and only if they can communicate 
directly with each other over the iMPI. 

Topology depths and colors are used in the multilevel topology-aware collective 
operations and topology-discovery mechanism described in Sections 3.2 and 3.3, 
respectively. 



3.2. Heterogeneous Communications 

MPICH-G2 achieves major performance improvements relative to the earlier 
MPICH-G |ll| by replacing Nexus (^0|, the multimethod, single-sided communi- 
cation library used for all communication in MPICH-G, with specialized MPICH- 
specific communication code. While Nexus has attractive features (e.g., multiproto- 
col support with highly tuned TCP support and automatic data conversion) , other 
attributes have proved less attractive from a performance perspective. MPICH-G2 
now handles all communication directly by reimplementing the good things about 
Nexus and improving the others. The result, as we show in Section ||, is that 
we achieve performance virtually identical to vendor MPI and MPICH configured 
with the default TCP (ch_p4) device. We provide here a detailed description of the 
improvements and additions to MPICH-G used to achieve this impressive perfor- 
mance. 



Increased bandwidth. In MPICH-G, each communication involved the copying 
of data to and from Nexus buffers in sending and receiving processes. MPICH-G2 
eliminates these two extra copies in the case of intramachine messages where a 
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vendor MPI exists. In this situation, sends and receives now flow directly from 
and to application buffers, respectively. In addition, for TCP messaging involving 
basic MPI datatypes (e.g., MPI_INT, MPI_FL0AT) the sending process also transmits 
directly from the application buffer. 

Reduced latency for intramachine vendor MPI messaging. Multiprotocol sup- 
port is achieved in Nexus by polling each protocol (TCP, vendor MPI, etc.) for 
incoming messages in a roundrobin fashion [fl6|| . However, this strategy is ineffi- 
cient in many situations: it is relatively expensive to poll a TCP socket and in 
practice it is often the case that many processes in a MPICH-G2 computation use 
only vendor MPI (for communicating with other processes on the same machine). 

While this inefficiency can be reduced by adaptive polling or by introducing 
distinct proxy processes |2^, [56|, MPICH-G2 takes a more direct approach, exploit- 
ing the knowledge about message source that is provided by TCP receive commands 
to eliminate TCP polling altogether in many situations. MPICH-G2 polls TCP only 
when the application is expecting data from a source that dictates, or might dictate 
(e.g., MPI_Recv specifies source=MPI_ANY_SOURCE), TCP messaging. 

This avoidance of unnecessary polling when coupled with the need to guarantee 
progress on both the vendor MPI and TCP protocols leads to implementation de- 
cisions that can affect an application's point-to-point communication performance. 
Specifically, for processes executing on machines where a vendor MPI is available, 
the context in which the application calls MPI_Recv affects the manner in which 
MPICH-G2 implements that function, as follows: 

• Specified. The source rank specified in the call to MPI_Recv explicitly iden- 
tifies a process on the same machine (in the same vendor MPI job) . Further- 
more, no asynchronous requests are outstanding (e.g., incomplete MPI_Irecv 
and/or MPI_Isend). If these two conditions are met, MPICH-G2 implements 
MPI_Recv by directly calling the MPI_Recv of the underlying vendor MPI. 
This is the most favorable circumstances under which an MPI_Recv can be 
performed. 

• Specified-pending. This category is similar to the specified category in that 
the MPI_Recv specifies an explicit source rank on the same machine. This 
time, however, one or more unsatisfied receive requests are present, and each 
such request specifies a source on the same machine. This situation forces 
MPICH-G2 to continuously poll (MPI_Iprobe) the vendor MPI for incoming 
messages. This scenario results in less efficient MPICH-G2 performance since 
the induced polling loop increases latency. 

• Multimethod. Here the source rank for the MPI_Recv is MPI_ANY_SOURCE or 
MPI_Recv is called in the presence of unsatisfied asynchronous requests that 
require, or might require, TCP messaging. In this situation, MPICH-G2 must 
poll both TCP and the vendor MPI continuously. This is the least efficient 
MPICH-G2 scenario, since the relatively large cost of TCP polling results in 
even greater latency. 

In Section ^, we present a quantitative analysis of the performance differences that 
result from these different structures. 
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More efficient use of sockets. The Nexus single-sided communication paradigm 
results in MPICH-G2 opening two pairs of sockets between communicating pro- 
cesses and using each pair as a simplex channel (i.e., data always flowing in one 
direction over each socket pair). MPICH-G2 opens a single pair of sockets between 
two processes and sends data in both directions. This approach reduces the use of 
system resources; moreover, by using sockets in the bidirectional manner in which 
they were intended, it also improves TCP efficiency. 



Multilevel topology-aware collective operations. Early implementations of MPI's 
collective operations sought to construct communication structures that were opti- 
mal under the assumption that all processes were equidistant from one another [Q, 
P|. Since this assumption is unlikely to be valid in Grid environments, however, 
it is desirable that a Grid-enabled MPI incorporate collective operation implemen- 
tations that take into account the actual topology. MPICH-G2 does this, and 
we have demonstrated substantial performance improvements for our multilevel 
topology-aware approach |52| relative both to topology-Mnaware binomial trees and 
earlier topology-aware approaches that distinguish only between "intracluster" and 
"intercluster" communications [[30, 35 1. 

As we explain in the next subsection, MPICH-G2's topology-aware collective 
operations are constructed in terms of topology discovery mechanisms that can 
also be used by topology-aware applications. 



3.3. Application-Level Management of Heterogeneity 

We have experimented within MPICH-G2 with a variety of mechanisms for 
application-level management of heterogeneity in the underlying platform. We 
mention two here. 



Topology discovery. Once an MPI program starts, all processes can be viewed 
as equivalent, distinguished only by their rank. This level of abstraction is desirable 
from a programming viewpoint but makes it difficult to write programs that exploit 
aspects of the underlying physical topology, for example, to minimize expensive 
intercluster communications. 

MPICH-G2 addresses this issue within the standard MPI framework by using 
the MPI communicator construct to deliver topology information to an application. 
It associates attributes with each MPI communicator to communicate this topology 
information, which is expressed within each process in terms of topology depths and 



colors, as described in Section 3.1 



MPICH-G2 applications can then query communicators to retrieve attribute 
values and structure themselves appropriately. For example, it is straightforward 
to create new communicators that reflect the underlying network topology. Figure ^| 
depicts an MPICH-G2 application that first queries the MPICH-G2-defined com- 
municator attributes MPICHX_T0P0L0GY_DEPTHS and MPICHX_T0P0L0GY_C0L0RS to 
discover topology depths and colors, respectively, and then uses those values to 
create three communicators: LANcomm, which groups processes based on site bound- 
aries, VcommA, which groups processes based on their ability to communicate with 
each other over wMPI, while placing all processes that cannot communicate over 
iMPI into a separate communicator, and VcommB, which groups the processes in 
much the same way as VcommA, but this time does not place processes that cannot 
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#include <mpi.h> 



int main(int argc, char *argv[]) 
{ 

int me, flag; 
int *depths ; 
int **colors; 

MPI_Comm LANcomm, VcommA, VcommB; 

MPI_Init (ftargc , ftargv) ; 
MPI_Comm_rank(MPI_COMM_WQRLD, feme) ; 

MPI_Attr_get (MPI_C0MM_W0RLD , MPICHX_TOPOLOGY_DEPTHS , ftdepths , fcf lag) ; 

MPI_Attr_get (MPI_C0MM_W0RLD , MPICHX_T0P0L0GY_C0L0RS , &colors , fcf lag) ; 

MPI_Comm_split(MPI_COMM_WORLD, colors [me] [1] , 0, ftLANcomm) ; 
MPI_Comm_split(MPI_COMM_WORLD, (depths[me] == 4 ? colors [me] [3] : -1) , 

0, ftVcommA) ; 
MPI_Comm_split (MPI_C0MM_WORLD , 

(depths [me] == 4 ? colors [me] [3] : MPI_UNDEFINED) , 

0, ftVcommB) ; 

MPI_Finalize() ; 

} 



FIG. 4 An example MPICH-G2 application that uses topology depths and colors 
to create communicators that group processes into various topology-aware clusters. 



communicate over iMPI in a communicator (i.e., VcommB is set to MPI_CDMM_NULL 

for those processes). 



Quality-of-service management. We have experimented with similar techniques 
for purposes of quality of service management f47| ]. When running over a shared 
network, an MPI application may wish to negotiate with an external resource man- 
agement system to obtain dedicated access to (part of) the network. We show that 
communicator attributes can be used to set and initiate quality-of-service parame- 
ters between selected processes. 



4. PERFORMANCE EXPERIMENTS 



We present the results of detailed performance experiments that characterize 
the performance of MPICH-G2 and demonstrate the major improvements achieved 
relative to its predecessor, MPICH-G. We begin by looking at the performance of 
intramachine communication over a vendor MPI. Then, we examine performance 
when TCP is the only choice for communicating between a pair of processes. In all 



cases, mpptest 28 , the performance tool included in the MPICH distribution, is 



used to obtain all results. 



4.1. Vendor MPI 

Evaluating the performance of MPICH-G2 when using a vendor MPI as an 
underlying communication mechanism is not as simple as running a single set of 
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FIG. 5 vMPI experiments - small message latency. 



ping-pong tests. As discussed earlier, the performance achieved by MPICH-G2 can 
be affected by outstanding requests and by the use of MPI_ANY_SDURCE. There fore , 
we have divided the experiments into the three categories described in Section 3.2 . 

Our vendor MPI experiments were run on an SGI Origin2000 at Argonne Na- 
tional Laboratory. Both MPICH-G2 and MPICH-G were built using a nonthreaded, 
no-debug flavor of Globus 1.1.4 and performed intramachine communication via 
SGFs implementation of MPI. 

One MPICH-G2 design goal was to minimize latency overhead for intramachine 
communication relative to an underlying vendor MPI. As can been seen in Figure ^|, 
MPICH-G2 does an outstanding job in this regard: only a few extra microseconds 
of latency are introduced by MPICH-G2 when the source of the message is specified 
and no other requests are outstanding. In contrast, MPICH-G added approximately 
80 microseconds of latency to each message, because the multiple steps required to 
implement the Nexus single-sided communication model. 

The introduction of pending receive requests has a modest impact on MPICH-G2 
message latencies. Messages falling into the specified-pending category incur slightly 
more overhead, as the MPICH-G2 progress engine must continuously poll (probe) 
the vendor MPI rather than blocking in a receive. Overall, MPICH-G2 latencies 
increase by several microseconds relative to the first case but are still far less than 
those of MPICH-G. 

The use of MPI_ANY_S0URCE has the largest impact on MPICH-G2 performance. 
The additional cost is associated with having to poll TCP as well as the vendor 
MPI. Polling TCP increases the latency of messages by nearly 20 microseconds over 
those in the specified-pending category. While the increase is significant, however, 
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FIG. 6 vMPI experiments - realized bandwidth. 



these latencies are still considerably less than for MPICH-G. 

While MPICH-G2 message latencies are affected by the use of MPI_ANY_S0URCE 
and pending receive requests, the realized bandwidths are largely unaffected. Fig- 
ure ^ shows the bandwidths obtained for messages up to one megabyte. We see 
that the bandwidths for MPICH-G2 are nearly identical for all but small messages. 
While the large message bandwidths for MPICH-G2 are approximately 7% less 
than those for the the vendor MPI (for reasons we do not yet understand), they 
represent an improvement of more than 60% over MPICH-G 



4.2. TCP/IP 

Performance optimization work on MPICH-G2 performed to date has focused 
on intramachine messaging when a vendor MPI is used as the underlying com- 
munication mechanism. The MPICH-G2 TCP/IP communication code has not 
been optimized. However, its performance is quite reasonable when compared with 
MPICH-G and to MPICH configured with the default TCP (ch_p4) device. 

All TCP /IP performance measurements were taken using a pair of SUN work- 
stations in Argonne's Mathematics and Computer Science Division. These two ma- 
chines were connected to a local-area network via gigabit Ethernet. Both MPICH-G 
and MPICH-G2 were built using a nonthreaded, no-debug flavor of Globus 1.1.4. 

Figure [7| shows the small message latencies exhibited by all three systems. We 
see that for most message sizes, MPICH-G2 is 20% to 30% slower than MPICH/ch_p4, 
although the difference is much smaller for very small messages. We also see that 
MPICH-G2 latencies, in most cases, are somewhat less than those of MPICH-G. 
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FIG. 7 TCP/IP experiments - small message latency. 



The most notable data point is barely visible on the graph but emphasizes a 
clear optimization that is missing in MPICH-G2. The latency for zero-byte mes- 
sages is 140 microseconds, while the latency for an eight-byte message is 224 mi- 
croseconds. The reason for this large difference is that MPICH-G2 currently uses 
separate system calls to send the message header and the message data. This data 
point suggests that by combining these two writes into a single vector write, we 
could reduce the latency of small messages significantly. While this difference might 
seem unimportant for machines separated by a wide-area network, it can be signifi- 
cant when MPICH-G2 is used to combine multiple machines with the same machine 
room or even at the same site. 

Figure || shows the bandwidths obtained by all three systems for message sizes 
up to one megabyte. For large messages, we see that MPICH-G2 performs approx- 
imately 5% better than the other two systems. This improvement is a result of the 
message data being sent directly from the user buffer rather than being copied into 
a separate buffer before write is called. For preposted receives with contiguous 
data, further improvement is possible. Data for these receives can be read directly 
into the user buffer, avoiding a buffer copy that, at present, always takes place at 
the receiver. 

5. APPLICATION EXPERIENCES 

MPICH-G2 has been used by many groups worldwide for a wide variety of 
purposes. Here we mention a few relevant experiences that highlight interesting 
features of the system. 
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FIG. 8 TCP/IP experiments - realized bandwidth. 



One interesting use of MPICH-G2 is to run conventional MPI programs across 
multiple parallel computers within the same machine room. In this case, MPICH-G2 
is used primarily to manage startup and to achieve efficient communication via use 
of different low-level communication methods. Other groups are using MPICH-G2 
to distribute applications across computers located at different sites, for exam- 
ple, Taylor performing MM5 climate modeling on the NSF TeraGrid fl6| ], 
Mahinthakumar forming multivariate geographic clusters to produce maps of re- 
gions of ecological similarity |Il| , Larsson for studies of distributed execution of a 
large computational electromagnetics code [Q, and Chen and Taylor in studies of 
automatic partitioning techniques, as applied to finite element codes (8). 

MPICH-G2 has also been successfully used in demonstrations that promote MPI 
as an application-level interface to Grids for nontraditional distributed computing 
applications, for example, Roy et al. for studies in using MPI idioms for setting 
QoS parameters fl7| and Papka and Binns for creating distributed visualization 
pipelines using MPICH-G2's client/server MPI-2 extensions J4§, |§]. 
MPICH-G2 

was awarded a 2001 Gordon Bell Award for its role in an astrophysics application 
used for solving problems in numerical relativity to study gravitational waves from 
colliding black holes gj. The winning team used MPICH-G2 to run across four 
supercomputers in California and Illinois, achieving scaling of 88% (1,140 CPUs) 
and 63% (1,500 CPUs) computing a problem size five times larger than any other 
previous run. 
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6. FUTURE WORK 



The successful development of MPICH-G2 and its widespread adoption both 
make it a useful platform for future research and create significant interest in its 
continued development. 

One immediate area of concern is full support for MPI-2 features. In particular, 
support for dynamic process management will allow MPICH-G2 to be used for 
a wider class of Grid computations in which either application requirements or 
resource availability changes dynamically over time. The necessary support exists 
in the Globus Toolkit, and so this work depends primarily on the availability of 
the next-generation ADI-3. Less obvious, but very interesting, is how to integrate 
support for fault tolerance into MPICH-G2 in a meaningful way. 

A second area of concern relates to exploring and refining MPICH-G2 sup- 
port for application-level management of heterogeneity. Initial experiments with 
topology discovery and quality-of-service management have been encouraging, but 
it seems inevitable that application experiences will reveal deficiencies in current 
techniques or suggest additional MPICH-G2 support that could further improve 
application flexibility. 

Our work on collective operations can be improved in various ways. In particu- 
lar, van de Geijn et al. || have shown that are advantages in implementing collec- 
tive operations by segmenting and pipelining messages when communicating over 
relatively slower channels (e.g., TCP over local- and wide-area networks). These 
pipelining techniques can be used throughout many of the levels in MPICH-G2's 
multilevel topology- aware collective operations. 

7. RELATED WORK 

A variety of approaches have been proposed to programming Grid applications, 
including object systems (|(| |2^| , Web technologies J22[ p0| , problem solving en- 
vironments @, ||, CORBA, workflow systems, high-throughput computing sys- 
tems pL |9), and compiler-based systems J33|. We assume that while different 
technologies will prove attractive for different purposes, a programming model such 
as MPI that allows direct control over low-level communications will always be 
attractive for certain applications. 

Other systems that support message passing in heterogeneous environments 
include the pioneering Parallel Virtual Machine (PVM) Q ||| and the PACX- 
MPI ||, MetaMPI Q, and STAMPI {|§ implementations of MPI, each of which 
addresses issues relating to efficient communication in heterogeneous wide-area sys- 
tems. STAMPI supports MPI-2 dynamic process management features. PACX- 
MPI, like MPICH-G2, supports the automatic startup of distributed computations, 
but uses ssh rather than the GRAM protocol with its integrated GSI authentica- 
tion, for that purpose; nor does it address issues of executable staging. PACX-MPI 
(and STAMPI) also differ in how it addresses wide-area communication. While in 
MPICH-G2, any processor may speak both local and wide-area communication pro- 
tocols, PACX-MPI and STAMPI forward all off-cluster communication operations 
to an intermediate gateway node. 

Other implementations of MPI include MPICH with the ch_p4 device and 
LAM/MPI p7| . By contrast these implementations were designed for local area 
networks and not computational grids. 
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The Interoperable MPI (IMPI) standards effort jn) defines standard message 
formats and protocols with a view to enabling interoperability among different MPI 
implementations. IMPI does not address issues of computation management and 
control; in principle, the techniques developed within MPICH-G2 could be used for 
that purpose. 

Other related projects include MagPIe p| and MPI-StarT [p0|, which show 
how careful consideration of communication topologies can result in significant im- 
provements after modifying the MPICH broadcast algorithm, which uses topology- 
unaware binomial trees. However, both limit their view of the network to only two 
layers; processors are either near or far. Further performance improvements can 
be realized by adopting the multilevel network view. We referred in the preceding 
section to the work of van de Geijn et al. ||. In |34) Kielman et al. have extended 
MagPIe by incorporating van de Geijn's pipelining idea through a technique they 
call Parameterized LogP (PLogP), which is an extension of the LogP model pre- 
sented by Culler et al |j) . In this extension, MagPIe still recognizes only a two- layer 
communication network, but through parameterized studies of the network they de- 
termine "optimal" packet sizes. 

Various projects have investigated programming model extensions to enable ap- 
plication management of QoS, for example, Quo pOfl . The only other relevant effort 
in the context of MPI is work on real-time extensions to MPI. MPI/RT j44|] provides 
a QoS interface but is not an established standard and introduces a new program- 
ming interface. Furthermore, the focus is on real-time needs such as predictability 
of performance and system resource usage more appropriate for embedded systems 
than for wide-area networks. 

8. SUMMARY 

We have described MPICH-G2, an implementation of the Message Passing In- 
terface that uses Globus Toolkit mechanisms to support the execution of MPI 
programs in heterogeneous wide-area environments. MPICH-G2 masks details of 
underlying networks, software systems, policies, and computer architectures so that 
diverse distributed resources can appear as a single MPI_CDMM_WDRLD. Arbitrary MPI 
applications can be started on heterogeneous collections of machines simply by typ- 
ing mpirun: authentication, authorization, executable staging, resource allocation, 
job creation, startup, and routing of stdout and stderr are all handled automat- 
ically via Globus Toolkit mechanisms. MPICH-G2 also enables the use of MPI 
features for user-level management of heterogeneity, for example, via the use of 
MPI's communicator construct to access system topology information. A wide 
range of successful application experiences have demonstrated MPICH-G2's util- 
ity in practical settings, both for traditional simulation applications and for less 
traditional applications such as distributed visualization pipelines. 

While MPICH-G2 is already a sophisticated tool that is seeing widespread use, 
there are also several areas in which it can be extended and improved. Support 
for MP 1-2 features, in particular dynamic process management, will be invaluable 
for Grid applications that adapt their resource usage to changing conditions and 
application requirements. This support will be provided as soon as it is incorpo- 
rated into MPICH. More challenging is the design of techniques for effective fault 
management, a major topic for future research. Here we may be able to draw upon 
techniques developed within systems such as PVM |^5| . 
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